Machine Learning for Peace:

Infrastructure and Applications

PDRI-DevLab

University of Pennsylvania

October 30, 2024

Jeremy Springman

Research Assistant Professor

Erik Wibbels, Serkant Adiguzel, Mateo Villamizar Chaparro, Zung-Ru Lin, Donald Moratz, Diego Romero, Hanling Su, Mahda Soltani

Overview


  1. Introducing Machine Learning for Peace
  2. Application: Forecasting Travel Warnings
  3. MLEED Extensions and Estimation

Introducing MLP Infrastructure

Background

  • Event data are key for understanding political dynamics
  • Media reports are the most comprehensive documentation
  • Positive: AI/ML provides new tools to extract information from text
  • Negative: Existing media corpora have poor coverage of domestic media outlets in developing countries
  • MLP provides high-quality corpus and flexible text processing infrastructure

High Quality Corpus

Input: Online news

  • 400+ news sources
  • 47 languages
  • 120 million articles

Data quality

  • Focus on high-quality local sources (medium data)
  • Direct, human monitored scraping
  • Much better coverage than extant archivers/aggregators (GDELT, Wayback, Lexis Nexis, etc.)



Output: Monthly data

  • 63 countries
  • 2012 - last month

Detecting Civic Space Events in Text

  • Robustly Optimized BERT Pretraining Approach (RoBERTa)
  • Pre-trained on enormous corpora of data + transfer learning
  • Fine-tuned on double human-coded training data (n=9,875)

Data Processing Pipeline


Event Detection

Forecasting Travel Advisories

Why Travel Advisories?

  • Request from US State Department
    • High-level travel advisories trigger deployment of resources
    • Anticipating location, timing of warnings can help smooth budgets
  • Travel advisories include political instability, natural disasters, health risks, etc

Data

  • Target: onset of a serious travel advisory
  • Predictors: MLP data, indicator for continued advisories, years, time trend, Bayesian country encoding

Modeling

  • Forecast Horizon: 3 and 6 months
  • Model: LightGBM + Temporal CV
  • Hyperparameters: Wide grid search for learning rate, proportion of features, depth of trees
  • Evaluation Metrics:
    • ROC-AUC: ranking months in test set
    • AUC-PR: optimal for imbalanced data

Performance

Performance

Performance

Performance

Feature Importance

MLEED Extensions

  • New classification model to detect environmental events
  • Geographic granularity down to ADM2
    • Current ability for ADM1
    • GPT needed for ADM2
  • Higher-frequency data is feasible

Estimating the Relationship Between Climate Events and Political Events

  • Unit of analysis: ADM2-months
  • Forecasting using temporal cross-validation
    • Correlation between short-term climate events and long-term climate conditions on the frequency of political events
  • Estimating causal effects
    • Effect of short-term weather shocks on political events

Comparing MLP with Big Data Media Corpora

Findings

  • International media sources have limited, skewed coverage of events in developing countries
    • Event datasets that rely on global language sources will have major biases
  • Accurate data collection from national news sources requires careful human curation
    • GDELT, Common Crawl, Internet Archive include major errors

Importance of Domestic News

Importance of Domestic News

::: notes - most of our news comes from national sources - fundamental differences in the type of events covered by domestic and international sources :::

Importance of Domestic News

Importance of Domestic News

Lexis Nexis vs MLP

Countries where LN has zero local sources: 6/56

  • Albania, Belarus, Kosovo, Jamaica, Angola, South Sudan

Comparing LN on other metrics:

  • Fewer languages: 17 vs 34
  • Slightly more local sources per country: 5.5 vs 5 (median; excluding MLP’s regional sources)
  • Shorter, more sporadic coverage over time

Challenges of Scraping

Scraping Case Study

Bangladesh: easiest case for automated scrapers

  • Massive volume
  • Good web architecture
  • 2/5 sources are in English

Scraping Domestic Outlets is Tough

GDELT

  • MLP: 2013 and 2015
  • GDELT: 2019 forward
  • GDELT’s best covered source: 2,100 articles/mo compared to 2,500 per month from MLP
  • Broken links, redirects, duplicate articles, and advertising
  • Restricts requests to one search every 5 seconds, so that scraping even a single source for the full time-period can take several days

Scraping Domestic Outlets is Tough

Internet Archive

  • Took nearly two weeks to collect URLs from a single source from 2019-2023
  • Numerous irrelevant, broken, and duplicate links
  • Less than half or urls were usable

Random Audit of 5 MLP Countries

Task

  • Use algorithm to identify major events
  • Use GPT to summarize 5 most important events
  • Human check of location and event classification accuracy

Results

  • 40 events detected from April - June 2024 (300 possible)
  • Correct country: 34/40
  • Correct event: 38/40